Basic Text analytics, also known as text mining, can be used to glean information or data from unstructured sources such as books or prose. Initially, after watching a video on YouTube (Moffit 2020) talking about Introduction of color into literature has a specific order, I wondered whether we could use text analytics to determine approximately when in a civilization a work was created. Expanding on that perhaps we could identify a specific period that a work was created based on it’s thumbprint derived from analyzing its text as data. One of the works cited in the video was “The Iliad” and that the word blue was not used. So I performed a quick analysis of this work.
Immediately, a problem was encountered. Performing the initial analysis showed that translators had not been faithful to the original text!
Illiad %>% colorWords()
#### A tibble: 6 x 2
word n
<chr> <int>
1 black 99
2 blue 14
3 green 12
4 red 16
5 white 50
6 yellow 10
This lead to the question of whether I can judge the reliability of a translation by using text analytics on a set of commonly used adjectives. I chose adjectives as they are part of descriptions and less likely to fall prone to idiomatic use with meanings other than literal like verbs and nouns might.
I would construct a word frequency analysis and propose a rule of measure. The words used would be common adjectives such as colors, sizes, distances. We would then attempt to develop a distance metric to show how ‘close’ a given translation is to the original text. The original text we would use will be the Aeneid by Virgil, a Latin saga about the mythic origins of the Roman people following the fall of Troy, sort of a sequel/spinoff of Homer’s Odyssey saga.
Loading our Libraries
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidytext)
library(hunspell)
library(glue)
##
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
##
## collapse
library(gutenbergr)
library(ggplot2)
#uncomment and run once if installing
#devtools::install_github("ricardo-bion/ggradar", dependencies = TRUE)
#devtools::install_github("cardiomoon/ggiraphExtra")
library(ggiraphExtra)
library(ggiraph)
library(ggradar)
First we must prepare the system itself for the languages we will be using; Latin and English. To have usable data, we will need to be able to process these languages in some way. We are performing text analytics to answer our questions. Of the seven basic operations (Mohler 2020) of text analytics we will be performing tokenization, and a modified form of part of speech tagging and syntax chunking called stemming.
To handle tokenization we rely on the tidytext libary. Tidytext will also perform some pre-processing for us. When processing words it will convert input into utf-8 lowercase text. It will then be the tokenizer, allowing us to separate a text source into a list of separate words.
Stemming refers to identifying the root of a particular word. This is important to our problem as Latin is a heavily inflected language. Each word can literally have hundreds of different forms depending on Voice, Number, Gender( masculine, feminine, and neuter), and Case ( part of speech ). The hunspell library and the language reference files for Latin are installed to handle this operation. This reference is stored in the Data Folder in this project. The English library is already part of the standard library loaded into the system. The hunspell library provides access to a variety of NLP functions and is commonly used as a real time spell checker in modern computer systems.
To create a way to compare different works we will create a list of 21 common adjectives and a concordance of stemmed translations to Latin. This will allow the tokenization and stemming of the original language work to be compared to our concordance
#let me send in lists of words as space separated strings for convenience
qq <- function(alist) { list = strsplit(alist," ", fixed= T); list = list[[1]] }
# Basic word frequency analysis on a provided text
colorwords2 = function(df) {
wordhist <- df %>% unnest_tokens(word,text) %>%
filter( word %in% c("white","black","red","yellow","green","blue","orange","indigo", "violet" ) ) %>% count(word)
wordhist }
First I’ll test the stemming functionality against a word and several of its known forms. In this case the Latin word ‘bonus’ meaning good and several of its common forms.
# Load up the Latin Hunspell
latin = dictionary("Data/la_LA.dic")
#TESTING the library
w = tibble ( verba = qq(
"bonus bonum boni bonas bonis bonissimus bona bonae bonas" ) )
df <- w %>% mutate(
stem= hunspell_stem(verba, dict=latin )
) %>% unnest(stem)
w
df
From testing we run into a problem; there is ambiguity in the returned results. That is not all of the stems are identical. In this case this arises from ambiguity in the Roman alphabet. The letter ‘V’ and the letter ‘U’ are interchangeable in Latin script.
We will ignore this for now and construct a concordance of stems associated with the adjectives we wish to use in our analysis. Since, in this example, the duplicates are all generated by alternate spellings of the same word there won’t be a problem in using them as potential stems for the concordance. We will use a Latin-English dictionary (Petterson and Rosengren 2020),(Foster 2018) and Grammar Reference (Krüger 2020) to help us remove unrelated stems where they occur.
First we selected 21 common adjective words. These will form the basis of our concordance. I chose adjectives as they are descriptive and less likely to have a different idiomatic meaning in a language than verbs or nouns might. Simple adjectives are also more likely to be common across a number of translations.
## [1] "black" "white" "red" "yellow" "green" "blue"
## [7] "orange" "purple" "kind" "right" "left" "bad"
## [13] "good" "diligent" "many" "only" "none" "one"
## [19] "swift" "strong" "old"
I retrieved data from a declension generator(Krüger 2020) for each of our selected words. There was no API available so the HTML was cut and pasted to word.declension.raw files in the project Data folder. I then created a function to read these raw text files and parse out the desired data.
Here is the raw text of one such file:
## [1] "\r\nSINGULAR\r\nm\tf\tn\r\nNom.\r\nater\tatra\tatrum\r\nGen.\r\natri\tatrae\tatri\r\nDat.\r\natro\tatrae\tatro\r\nAcc.\r\natrum\tatram\tatrum\r\nAbl.\r\natro\tatra\tatro\r\nPLURAL\r\nm\tf\tn\r\nNom.\r\natri\tatrae\tatra\r\nGen.\r\natrorum\tatrarum\tatrorum\r\nDat.\r\natris\tatris\tatris\r\nAcc.\r\natros\tatras\tatra\r\nAbl.\r\natris\tatris\tatris\r\n\r\nComparative\r\nAdverb: atrius\r\nSINGULAR\r\nm, f\tn\r\nNom.\r\natrior\tatrius\r\nGen.\r\natrioris\tatrioris\r\nDat.\r\natriori\tatriori\r\nAcc.\r\natriorem\tatrius\r\nAbl.\r\natriore\tatriore\r\nPLURAL\r\nm, f\tn\r\nNom.\r\natriores\tatriora\r\nGen.\r\natriorum\tatriorum\r\nDat.\r\natrioribus\tatrioribus\r\nAcc.\r\natriores\tatriora\r\nAbl.\r\natrioribus\tatrioribus\r\n\r\nSuperlative\r\nAdverb: aterrime\r\nSINGULAR\r\nm\tf\tn\r\nNom.\r\naterrimus\taterrima\taterrimum\r\nGen.\r\naterrimi\taterrimae\taterrimi\r\nDat.\r\naterrimo\taterrimae\taterrimo\r\nAcc.\r\naterrimum\taterrimam\taterrimum\r\nVoc.\r\naterrime\taterrima\taterrimum\r\nAbl.\r\naterrimo\taterrima\taterrimo\r\nPLURAL\r\nm\tf\tn\r\nNom.\r\naterrimi\taterrimae\taterrima\r\nGen.\r\naterrimorum\taterrimarum\taterrimorum\r\nDat.\r\naterrimis\taterrimis\taterrimis\r\nAcc.\r\naterrimos\taterrimas\taterrima\r\nVoc.\r\naterrimi\taterrimae\taterrima\r\nAbl.\r\naterrimis\taterrimis\taterrimis"
This raw file is then converted to string containing all forms of the word and then converted to a list of words.
## [1] "ater atra atrum atri atrae atri atro atrae atro atrum atram atrum atro atra atro atri atrae atra atrorum atrarum atrorum atris atris atris atros atras atra atris atris atris atrior atrius atrioris atrioris atriori atriori atriorem atrius atriore atriore atriores atriora atriorum atriorum atrioribus atrioribus atriores atriora atrioribus atrioribus aterrimus aterrima aterrimum aterrimi aterrimae aterrimi aterrimo aterrimae aterrimo aterrimum aterrimam aterrimum aterrime aterrima aterrimum aterrimo aterrima aterrimo aterrimi aterrimae aterrima aterrimorum aterrimarum aterrimorum aterrimis aterrimis aterrimis aterrimos aterrimas aterrima aterrimi aterrimae aterrima aterrimis aterrimis aterrimis"
## [1] "ater" "atra" "atrum" "atri" "atrae"
## [6] "atri" "atro" "atrae" "atro" "atrum"
## [11] "atram" "atrum" "atro" "atra" "atro"
## [16] "atri" "atrae" "atra" "atrorum" "atrarum"
## [21] "atrorum" "atris" "atris" "atris" "atros"
## [26] "atras" "atra" "atris" "atris" "atris"
## [31] "atrior" "atrius" "atrioris" "atrioris" "atriori"
## [36] "atriori" "atriorem" "atrius" "atriore" "atriore"
## [41] "atriores" "atriora" "atriorum" "atriorum" "atrioribus"
## [46] "atrioribus" "atriores" "atriora" "atrioribus" "atrioribus"
## [51] "aterrimus" "aterrima" "aterrimum" "aterrimi" "aterrimae"
## [56] "aterrimi" "aterrimo" "aterrimae" "aterrimo" "aterrimum"
## [61] "aterrimam" "aterrimum" "aterrime" "aterrima" "aterrimum"
## [66] "aterrimo" "aterrima" "aterrimo" "aterrimi" "aterrimae"
## [71] "aterrima" "aterrimorum" "aterrimarum" "aterrimorum" "aterrimis"
## [76] "aterrimis" "aterrimis" "aterrimos" "aterrimas" "aterrima"
## [81] "aterrimi" "aterrimae" "aterrima" "aterrimis" "aterrimis"
## [86] "aterrimis"
Finally, the forms are stemmed by the hunspell library and converted to a mini-concordance
english = "black"
verba = w <- tibble( verba = qq(declension)) %>% mutate(LatinStem = hunspell_stem(verba,dict = latin)) %>% unnest(LatinStem) %>% select(LatinStem ) %>% distinct() %>% mutate(English =glue({english}))
verba
In this case for the word ater, I know that neither atrium nor atrivm are valid forms for this word so we remove those rows and then add them to our concordance
verba <- verba %>% filter(!LatinStem %in% qq("atrium atrivm"))
concordance <- verba
concordance
I introduce a function to do this cleanup work for each word. It is still necessary to inspect the mini concordance before adding its rows to our concordance
# Regular expression excluding all unwanted text from a declension table
# generated at https://latin.cactus2000.de/noun/shownoun_en.php?n=generator
# "^m|\sf\s|n$|,|\.|^.\s+$|Nom|Gen|Dat|Acc|Abl|Voc|SINGULAR|PLURAL|^Adverb.*$|Superlative|Comparative",""
string_declension <- function (verbum) {
declension <- read_file(glue ("./Data/{verbum}.declension.raw"))
#get rid of line control characters
declension<- str_replace_all(declension, ".*Adverb:.*\\r\\n","")
declension <- str_replace_all(declension, "\\r\\n|\\t|,"," ")
declension <- str_replace_all(declension, "\\s+|\\r\\n|\\t|,|/|-"," ")
#get rid of unwanted characters from html table
declension <- str_replace_all(declension, "\\bm\\b|\\bf\\b|\\bn\\b|\\.|^.\\s+$|Nom|Gen|Dat|Acc|Abl|Voc|SINGULAR|PLURAL|Superlative|Comparative|","")
declension <- str_replace_all(declension, "\\s+"," ")
declension <- str_replace_all(declension, "\\s+"," ")
declension <- str_replace_all(declension, "^\\s|\\s$","")
}
get_declension <- function (verbum, english, forms=NULL) {
if (is.null(forms)){
declension <- string_declension(glue({verbum}))
} else {
declension <- forms
}
w <- tibble( verba = qq(declension)) %>% mutate(LatinStem = hunspell_stem(verba,dict = latin)) %>% unnest(LatinStem) %>% select(LatinStem ) %>% distinct() %>% mutate(English =glue({english}))
}
The selected words for our concordance are :
black, white, red, yellow, green, blue, purple, orange
And a collection of common Latin adjectives: kind, right (direction), left , bad, good, diligent, many, only, none, one, swift, strong,old.
This approach can later be expanded by including synonyms in each language of interest and some context via NLP could assist in discriminating a bit further.
We now proceed to process the declensions ( list of inflected forms ) for each word and add to our concordance.)
# Examine w , remove invalid stems , mutate to add the english word to concordance
niger <- get_declension("niger","black")
niger
This word has no ambiguous forms
concordance <- add_row(concordance,niger)
albus <- get_declension("albus","white")
albus
This has two ambiguous or invalid forms for the word I want: albeo, alboris
albus <- albus %>% filter(!LatinStem %in% qq("albeo alboris"))
concordance <- add_row(concordance, albus)
concordance
candidus <- get_declension("candidus","white")
candidus
concordance <- add_row(concordance, candidus)
concordance
ruber <- get_declension("ruber","red")
ruber
This has a single ambiguous stem
ruber <- ruber %>% filter(!LatinStem %in% qq("rubrus"))
concordance <- add_row(concordance, ruber)
concordance
flavus <- get_declension("flavus","yellow")
flavus
flavus <- flavus %>% filter(!LatinStem %in% qq("flo flaveo"))
concordance <- add_row(concordance, flavus)
concordance
fulvus <- get_declension("fulvus","yellow")
fulvus
concordance <- add_row(concordance, fulvus)
concordance
viridis <- get_declension("viridis","green")
viridis
viridis <- viridis %>% filter(!LatinStem %in% qq("virido"))
concordance <- add_row(concordance, viridis)
concordance
caeruleus <- get_declension("caeruleus","blue")
caeruleus
This is interesting. This word ‘caeruleus’ has a compound comparative and superlative but preserves the earlier forms.
## [1] "caeruleus caerulea caeruleum caerulei caeruleae caerulei caeruleo caeruleae caeruleo caeruleum caeruleam caeruleum caerulee caerulea caeruleum caeruleo caerulea caeruleo caerulei caeruleae caerulea caeruleorum caerulearum caeruleorum caeruleis caeruleis caeruleis caeruleos caeruleas caerulea caerulei caeruleae caerulea caeruleis caeruleis caeruleis magis caeruleus magis caerulea magis caeruleum magis caerulei magis caeruleae magis caerulei magis caeruleo magis caeruleae magis caeruleo magis caeruleum magis caeruleam magis caeruleum magis caeruleo magis caerulea magis caeruleo magis caerulei magis caeruleae magis caerulea magis caeruleorum magis caerulearum magis caeruleorum magis caeruleis magis caeruleis magis caeruleis magis caeruleos magis caeruleas magis caerulea magis caeruleis magis caeruleis magis caeruleis maxime caeruleus maxime caerulea maxime caeruleum maxime caerulei maxime caeruleae maxime caerulei maxime caeruleo maxime caeruleae maxime caeruleo maxime caeruleum maxime caeruleam maxime caeruleum maxime caeruleo maxime caerulea maxime caeruleo maxime caerulei maxime caeruleae maxime caerulea maxime caeruleorum maxime caerulearum maxime caeruleorum maxime caeruleis maxime caeruleis maxime caeruleis maxime caeruleos maxime caeruleas maxime caerulea maxime caeruleis maxime caeruleis maxime caeruleis"
So, we’ll just omit the helping adverb ‘magnus’ for blue for our concordance since the unadorned stem will still appear anywhere it is used.
caeruleus <- caeruleus %>% filter( ! grepl("m.*",LatinStem))
caeruleus
concordance <- add_row(concordance, caeruleus)
concordance
The next word does the same thing so I only loaded the Positive declension into its file
croceus <- get_declension("croceus","orange")
croceus
concordance <- add_row(concordance, croceus)
concordance
purpureus <- get_declension("purpureus","purple")
purpureus
concordance <- add_row(concordance, purpureus)
concordance
amicus <- get_declension("amicus","kind")
amicus <- amicus %>% filter(!LatinStem %in% qq("amicvi amicio"))
concordance<- add_row(concordance,amicus)
concordance
The word “dexter”‘s declension is different. It has alternate forms available in it’s declension. So I’ve re-factored the declension utilities. Added the’/’ as a delimiter to handle.
## [1] "SINGULAR m f n Nom. dexter dextra dextrum / dexterum Gen. dextri dextrae dextri Dat. dextro / dextero dextrae dextro / dextero Acc. dextrum / dexterum dextram dextrum / dexterum Abl. dextro / dextero dextra dextro / dextero PLURAL m f n Nom. dextri dextrae dextra Gen. dextrorum dextrarum dextrorum Dat. dextris dextris dextris Acc. dextros dextras dextra Abl. dextris dextris dextris Comparative SINGULAR m f n Nom. dexterior dexterius Gen. dexterioris dexterioris Dat. dexteriori dexteriori Acc. dexteriorem dexterius Abl. dexteriore dexteriore PLURAL m f n Nom. dexteriores dexteriora Gen. dexteriorum dexteriorum Dat. dexterioribus dexterioribus Acc. dexteriores dexteriora Abl. dexterioribus dexterioribus Superlative SINGULAR m f n Nom. dextumus / dextimus dextuma / dextima dextumum / dextimum Gen. dextumi / dextimi dextumae / dextimae dextumi / dextimi Dat. dextumo / dextimo dextumae / dextimae dextumo / dextimo Acc. dextumum / dextimum dextumam / dextimam dextumum / dextimum Voc. dextume dextuma / dextima dextumum / dextimum Abl. dextumo / dextimo dextuma / dextima dextumo / dextimo PLURAL m f n Nom. dextumi / dextimi dextumae / dextimae dextuma / dextima Gen. dextumorum / dextimorum dextumarum / dextimarum dextumorum / dextimorum Dat. dextumis / dextimis dextumis / dextimis dextumis / dextimis Acc. dextumos / dextimos dextumas / dextimas dextuma / dextima Voc. dextumi / dextimi dextumae / dextimae dextuma / dextima Abl. dextumis / dextimis dextumis / dextimis dextumis / dextimis "
declension <- string_declension("dexter")
declension
## [1] "dexter dextra dextrum dexterum dextri dextrae dextri dextro dextero dextrae dextro dextero dextrum dexterum dextram dextrum dexterum dextro dextero dextra dextro dextero dextri dextrae dextra dextrorum dextrarum dextrorum dextris dextris dextris dextros dextras dextra dextris dextris dextris dexterior dexterius dexterioris dexterioris dexteriori dexteriori dexteriorem dexterius dexteriore dexteriore dexteriores dexteriora dexteriorum dexteriorum dexterioribus dexterioribus dexteriores dexteriora dexterioribus dexterioribus dextumus dextimus dextuma dextima dextumum dextimum dextumi dextimi dextumae dextimae dextumi dextimi dextumo dextimo dextumae dextimae dextumo dextimo dextumum dextimum dextumam dextimam dextumum dextimum dextume dextuma dextima dextumum dextimum dextumo dextimo dextuma dextima dextumo dextimo dextumi dextimi dextumae dextimae dextuma dextima dextumorum dextimorum dextumarum dextimarum dextumorum dextimorum dextumis dextimis dextumis dextimis dextumis dextimis dextumos dextimos dextumas dextimas dextuma dextima dextumi dextimi dextumae dextimae dextuma dextima dextumis dextimis dextumis dextimis dextumis dextimis"
That worked so proceeding as before.
dexter <- get_declension("dexter","right")
dexter
All of these are valid so adding to the concordance.
concordance<- add_row(concordance,dexter)
concordance
sinister <- get_declension("sinister","left")
sinister
concordance<- add_row(concordance,sinister)
concordance
Malus, or ‘evil,bad’ turns out to be irregular and changes its form. I find this amusing that evil doesn’t follow the rules.
print(string_declension("malus"))
## [1] "malus mala malum mali malae mali malo malae malo malum malam malum male mala malum malo mala malo mali malae mala malorum malarum malorum malis malis malis malos malas mala mali malae mala malis malis malis peior peius peioris peioris peiori peiori peiorem peius peiore peiore peiores peiora peiorum peiorum peioribus peioribus peiores peiora peioribus peioribus pessimus pessima pessimum pessimi pessimae pessimi pessimo pessimae pessimo pessimum pessimam pessimum pessime pessima pessimum pessimo pessima pessimo pessimi pessimae pessima pessimorum pessimarum pessimorum pessimis pessimis pessimis pessimos pessimas pessima pessimi pessimae pessima pessimis pessimis pessimis"
The declension doesn’t appear to cause any problems with tools developed so far so we’ll proceed as usual
malus <- get_declension("malus","bad")
malus
malus <- malus %>% filter(! LatinStem %in% qq("peioro malvi ") )
concordance <- add_row (concordance, malus)
concordance
bonus <- get_declension("bonus","good")
bonus
bonus <- bonus %>% filter(! LatinStem %in% qq("melioro melium") )
concordance <- add_row (concordance, bonus)
concordance
diligens <- get_declension("diligens","diligent")
diligens
diligens <- diligens %>% filter (! LatinStem %in% qq("diligo"))
concordance <- add_row(concordance,diligens)
concordance
We run into another issue in the irregular declension of multus where certain forms don’t exist and are represented by ‘-’ in their declension file. So we add another rule to remove ‘-’ where it appears.
declension <- read_file("Data/multus.declension.raw")
declension<- str_replace_all(declension, ".*Adverb:.*\\r\\n","")
declension <- str_replace_all(declension, "\\r\\n|\\t|,"," ")
declension <- str_replace_all(declension, "\\s+|\\r\\n|\\t|,"," ")
declension
## [1] "SINGULAR m f n Nom. multus multa multum Gen. multi multae multi Dat. multo multae multo Acc. multum multam multum Voc. multe multa multum Abl. multo multa multo PLURAL m f n Nom. multi multae multa Gen. multorum multarum multorum Dat. multis multis multis Acc. multos multas multa Voc. multi multae multa Abl. multis multis multis Comparative SINGULAR m f n Nom. plus plus Gen. pluris pluris Dat. - - Acc. plus plus Abl. - - PLURAL m f n Nom. plures plura / pluria Gen. plurium plurium Dat. pluribus pluribus Acc. plures plura / pluria Abl. pluribus pluribus Superlative SINGULAR m f n Nom. plurimus plurima plurimum Gen. plurimi plurimae plurimi Dat. plurimo plurimae plurimo Acc. plurimum plurimam plurimum Voc. plurime plurima plurimum Abl. plurimo plurima plurimo PLURAL m f n Nom. plurimi plurimae plurima Gen. plurimorum plurimarum plurimorum Dat. plurimis plurimis plurimis Acc. plurimos plurimas plurima Voc. plurimi plurimae plurima Abl. plurimis plurimis plurimis"
Added a rule to our regular expression
print(string_declension("multus"))
## [1] "multus multa multum multi multae multi multo multae multo multum multam multum multe multa multum multo multa multo multi multae multa multorum multarum multorum multis multis multis multos multas multa multi multae multa multis multis multis plus plus pluris pluris plus plus plures plura pluria plurium plurium pluribus pluribus plures plura pluria pluribus pluribus plurimus plurima plurimum plurimi plurimae plurimi plurimo plurimae plurimo plurimum plurimam plurimum plurime plurima plurimum plurimo plurima plurimo plurimi plurimae plurima plurimorum plurimarum plurimorum plurimis plurimis plurimis plurimos plurimas plurima plurimi plurimae plurima plurimis plurimis plurimis"
multus <- get_declension("multus","many")
multus
And proceeded as usual
multus <- multus %>% filter (! LatinStem %in% qq("pluvi"))
concordance <- add_row(concordance,multus)
concordance
solus <- get_declension("solus","only")
solus
solus <- solus %>% filter (! LatinStem %in% qq("solvi soleo"))
concordance <- add_row(concordance,solus)
concordance
declension <- read_file("Data/nullus.declension.raw")
declension<- str_replace_all(declension, ".*Adverb:.*\\r\\n","")
declension <- str_replace_all(declension, "\\r\\n|\\t|,"," ")
declension <- str_replace_all(declension, "\\s+|\\r\\n|\\t|,"," ")
declension
## [1] "SINGULAR m f n Nom. nullus nulla nullum Gen. nullius / -i nullius / -ae nullius / -i Dat. nulli / -o nulli / -ae nulli / -o Acc. nullum nullam nullum Voc. nulle nulla nullum Abl. nullo nulla nullo PLURAL m f n Nom. nulli nullae nulla Gen. nullorum nullarum nullorum Dat. nullis nullis nullis Acc. nullos nullas nulla Voc. nulli nullae nulla Abl. nullis nullis nullis"
Another twist. this time it will take manual intervention to clean it. First we will collect all the words. remove the extraneous endings then add the missing forms and then stem the result into a mini-concordance, and verify for valid forms. We will modify get_declension to accept verba as a list.
Regular expressions were not matching characters as expected, using other methods to remove “ī” and “ō” from the list. Got position by matching REGEX wildcard for single character entries and then used lapply/sapply and indexing to remove the entries.
declension <- qq(string_declension("nullus"))
declension
## [1] "nullus" "nulla" "nullum" "nullius" "i" "nullius"
## [7] "ae" "nullius" "i" "nulli" "o" "nulli"
## [13] "ae" "nulli" "o" "nullum" "nullam" "nullum"
## [19] "nulle" "nulla" "nullum" "nullo" "nulla" "nullo"
## [25] "nulli" "nullae" "nulla" "nullorum" "nullarum" "nullorum"
## [31] "nullis" "nullis" "nullis" "nullos" "nullas" "nulla"
## [37] "nulli" "nullae" "nulla" "nullis" "nullis" "nullis"
# remove single character entries
remove <- lapply (declension, function(ch) grep ("^.$", ch) )
! sapply(remove,function(x) length(x) >0 )
## [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
## [13] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [37] TRUE TRUE TRUE TRUE TRUE TRUE
declension <- declension[! sapply(remove,function(x) length(x) >0 )]
declension
## [1] "nullus" "nulla" "nullum" "nullius" "nullius" "ae"
## [7] "nullius" "nulli" "nulli" "ae" "nulli" "nullum"
## [13] "nullam" "nullum" "nulle" "nulla" "nullum" "nullo"
## [19] "nulla" "nullo" "nulli" "nullae" "nulla" "nullorum"
## [25] "nullarum" "nullorum" "nullis" "nullis" "nullis" "nullos"
## [31] "nullas" "nulla" "nulli" "nullae" "nulla" "nullis"
## [37] "nullis" "nullis"
Next remove ae and then add the alternate entries
remove <- lapply (declension, function(ch) grep ("^ae$", ch) )
! sapply(remove,function(x) length(x) >0 )
## [1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE
## [13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [37] TRUE TRUE
declension <- declension[! sapply(remove,function(x) length(x) >0 ) ]
declension <- c(declension, qq("nūllī nūllae nūllī nūllō nūllae nūllō"))
declension
## [1] "nullus" "nulla" "nullum" "nullius" "nullius" "nullius"
## [7] "nulli" "nulli" "nulli" "nullum" "nullam" "nullum"
## [13] "nulle" "nulla" "nullum" "nullo" "nulla" "nullo"
## [19] "nulli" "nullae" "nulla" "nullorum" "nullarum" "nullorum"
## [25] "nullis" "nullis" "nullis" "nullos" "nullas" "nulla"
## [31] "nulli" "nullae" "nulla" "nullis" "nullis" "nullis"
## [37] "nulli" "nullae" "nulli" "nullo" "nullae" "nullo"
nullus <- get_declension("nullus","none",forms = declension)
nullus
concordance<-add_row(concordance,nullus)
concordance
unus <- get_declension("unus","one")
unus
unus <- unus %>% filter(!LatinStem %in% qq("unii unio"))
concordance <- add_row(concordance,unus)
concordance
celer <- get_declension("celer","swift")
celer
celer <- celer %>% filter(!LatinStem %in% qq("celo celero"))
concordance <- add_row(concordance,celer)
concordance
fortis <- get_declension("fortis","strong")
fortis
concordance <- add_row(concordance,fortis)
concordance
vetus <- get_declension("vetus","old")
vetus
vetus <- vetus %>% filter(!LatinStem %in% qq("i ii vetero vetvi vetero veto veterrimus veterrimvs vetustus"))
vetus
concordance<-add_row(concordance,vetus)
concordance
Adding a lookup function:
lookup = function(word) {
concordance %>%
filter(LatinStem == word) %>%
select(English)
}
This completes our proposed concordance. Next we’ll load in an original Latin manuscript, tokenize it , stem these words , Join our concordance to this, and then perform a frequency analysis on the result. This will be our base line to compare translations.
Using the gutenbergr library, I will load the text of the Aeneid in Latin and several of its English translations. I will also load “The Jungle” by Upton Sinclair for comparison.
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
TranslationAeneid228 <- gutenberg_download(228)
TranslationAeneid18466 <- gutenberg_download(18466)
TranslationAeneid22456 <- gutenberg_download(22456)
TranslationAeneid29358 <- gutenberg_download(29358)
TranslationAeneid49844 <- gutenberg_download(49844)
# for comparison
USTheJungle <- gutenberg_download(120)
Now that we have a source with concordance to look at and five different translations we can start to ask some questions. First how do they compare in raw size?
Text_sizes <- tibble (title = "TheLatinAeneid", wordcount = TheLatinAeneid %>% unnest_tokens(word,text)%>% nrow())
Text_sizes <- add_row(Text_sizes,title = "TranslationAeneid228", wordcount = TranslationAeneid228 %>% unnest_tokens(word,text)%>% nrow())
Text_sizes <- add_row(Text_sizes,title = "TranslationAeneid18466", wordcount = TranslationAeneid18466 %>% unnest_tokens(word,text)%>% nrow())
Text_sizes <- add_row(Text_sizes,title = "TranslationAeneid22456", wordcount = TranslationAeneid22456 %>% unnest_tokens(word,text)%>% nrow())
Text_sizes <- add_row(Text_sizes,title = "TranslationAeneid29358", wordcount = TranslationAeneid29358 %>% unnest_tokens(word,text)%>% nrow())
Text_sizes <- add_row(Text_sizes,title = "TranslationAeneid49844", wordcount = TranslationAeneid49844 %>% unnest_tokens(word,text)%>% nrow())
Text_sizes <- add_row(Text_sizes,title = "USTheJungle", wordcount = USTheJungle %>% unnest_tokens(word,text)%>% nrow())
Text_sizes
Text_sizes %>% ggplot( aes(title,wordcount))+
geom_col(fill = ifelse(Text_sizes$title =="TheLatinAeneid", 'red','black'))+coord_flip()+theme(axis.text.x = element_text(angle = 90))+
labs(title= 'Total Word Count Comparison among the works')
Right away we can see that the information contained in each edition varies by a significant amount. This can be explained by front matter differences, different approaches to translation, and additional matter inserted into the edition such as commentary.
So the next thing to attempt… can we see how faithful each edition is to the original? To do this we will attempt to construct a word frequency ‘yardstick’ from the original Latin work and apply it to each of the editions in turn.
Analysis <-TheLatinAeneid %>% unnest_tokens(word,text)
Analysis$LatinStem<- Analysis$word %>% hunspell_stem( dict=latin ) %>% lapply( function(x) paste(x, collapse = " ")) %>% unlist()
Analysis<-Analysis %>% mutate(English="")
#foreach row of analysis
for( row in 1:nrow(Analysis)){
stem <- Analysis[row, "LatinStem"]
if ( ! is_tibble(stem)){
print("not a tibble")
}else{
#print(glue("row:{row}:tibble:{stem}"))
#glimpse(stem)
stems = qq(stem %>% pull())
if (length(stems) < 1) {
#print ("skipping")
next
}
for(LatinStem in stems) {
english <- lookup(LatinStem) %>% pull()
if ( length(english ) == 0 ){
next
}else{
Analysis[row,"English"] = english
}
}#end for
}#Else-fi
}#For row
Now that we have an English word assigned to each Latin word that is in our concordance we can perform a word-frequency analysis on the original Latin work. We’ll do this for each edition and display them separately. First we will create a list of our concordance words to filter the WF analysis for each work we generate.
wf = Analysis %>% select(English) %>% filter(! English =="" ) %>% count(English)
wf
Then we will build up our WF results row by row, each row representing a different edition.
concordance_words = as.vector( concordance %>% select(English) %>% pull())
wfAnalysis<- wf %>% pivot_wider(names_from = English, values_from = n) %>% mutate(WorkTitle = 'TheLatinAeneid') %>% select(WorkTitle,everything())
wfAnalysis
wf = TranslationAeneid228 %>% unnest_tokens(word,text) %>% filter( word %in% concordance_words)%>% select(word) %>% count(word)
wf <- wf %>% pivot_wider(names_from = word, values_from = n) %>% mutate(WorkTitle = 'TranslationAeneid228') %>% select(WorkTitle,everything())
wfAnalysis <- add_row(wfAnalysis,wf)
wf = TranslationAeneid18466 %>% unnest_tokens(word,text) %>% filter( word %in% concordance_words)%>% select(word) %>% count(word)
wf <- wf %>% pivot_wider(names_from = word, values_from = n) %>% mutate(WorkTitle = 'TranslationAeneid18466') %>% select(WorkTitle,everything())
wfAnalysis <- add_row(wfAnalysis,wf)
wf = TranslationAeneid22456 %>% unnest_tokens(word,text) %>% filter( word %in% concordance_words)%>% select(word) %>% count(word)
wf <- wf %>% pivot_wider(names_from = word, values_from = n) %>% mutate(WorkTitle = 'TranslationAeneid22456') %>% select(WorkTitle,everything())
wfAnalysis <- add_row(wfAnalysis,wf)
wf = TranslationAeneid29358 %>% unnest_tokens(word,text) %>% filter( word %in% concordance_words)%>% select(word) %>% count(word)
wf <- wf %>% pivot_wider(names_from = word, values_from = n) %>% mutate(WorkTitle = 'TranslationAeneid29358') %>% select(WorkTitle,everything())
wfAnalysis <- add_row(wfAnalysis,wf)
wf = TranslationAeneid49844 %>% unnest_tokens(word,text) %>% filter( word %in% concordance_words)%>% select(word) %>% count(word)
wf <- wf %>% pivot_wider(names_from = word, values_from = n) %>% mutate(WorkTitle = 'TranslationAeneid49844') %>% select(WorkTitle,everything())
wfAnalysis <- add_row(wfAnalysis,wf)
wf = USTheJungle %>% unnest_tokens(word,text) %>% filter( word %in% concordance_words)%>% select(word) %>% count(word)
wf <- wf %>% pivot_wider(names_from = word, values_from = n) %>% mutate(WorkTitle = 'USTheJungle') %>% select(WorkTitle,everything())
wfAnalysis <- add_row(wfAnalysis,wf)
#clean up for radar graph, turn NA to 0
wfAnalysis_radar <-wfAnalysis
wfAnalysis_radar[is.na(wfAnalysis_radar)]<-0
wfAnalysis_radar
With our WF table complete we can graph the WF chart for each work.
#Convert data to long form for ggplot tools
longwfAnalysis = wfAnalysis_radar %>% pivot_longer(!WorkTitle, names_to = "words", values_to= "count")
works = wfAnalysis_radar %>% select(WorkTitle) %>% pull()
works
## [1] "TheLatinAeneid" "TranslationAeneid228" "TranslationAeneid18466"
## [4] "TranslationAeneid22456" "TranslationAeneid29358" "TranslationAeneid49844"
## [7] "USTheJungle"
colors <- c('red','green','blue','purple','black','orange','yellow' )
for ( i in 1:length(works)) {
WFtitle = works[i]
p <- ggplot(longwfAnalysis %>% filter(WorkTitle == WFtitle), aes(words,count,fill=WorkTitle)) +
geom_col()+
scale_fill_manual(values = c(colors[i]))+
theme(axis.text.x = element_text(angle = 90)) +
labs(title = paste(WFtitle,"- Concordance Word Frequency Analysis" ))
print(p)
}
This shows us some quantitative differences between each edition of the work. but we don’t yet have a good idea of distance between them.
Next we can look at the proportional occurrence of each word in the concordance using a stacked bar graph comparing each variable on a normalized scale. An ideal or close translation would have identical proportion as the original for each variable.
#Set order to match other graphs and color scheme
x<-factor(unique(longwfAnalysis$WorkTitle),ordered = T)
fct_shift(x)
## [1] TheLatinAeneid TranslationAeneid228 TranslationAeneid18466
## [4] TranslationAeneid22456 TranslationAeneid29358 TranslationAeneid49844
## [7] USTheJungle
## 7 Levels: TranslationAeneid18466 < ... < TheLatinAeneid
ggplot(longwfAnalysis, aes(words,count,fill=WorkTitle)) +
geom_col(position = "fill")+
#scale_fill_discrete(name= "Work Title", labels = x) +
scale_fill_manual(values = c(colors))+
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Concordance Word Frequency Proportion Analysis" )
Using a fixed order of the concordance as a ‘lens’ we can view the shape of each translation compared to the original. Here we first look at each shape for each work and then compare them pairwise with the original Latin work.
Ideally the resulting shapes would share all points of congruence with the original. The ‘best’ translation should be the best fit to the shape of the original text. This comparison is done using absolute values collected and presented in a radar chart.
Again the shape and congruence are more important than the actual numbers for this analysis offsetting the cognitive issues in trying to interpret a quantitative datum vs this holistic approach. A weakness to this approach is the need for the data to be in a particular order to have comparable polygons as a result.
ggRadar(data=wfAnalysis_radar,aes(color= WorkTitle, facet=WorkTitle),interactive=T,alpha = .5, rescale = F, size = 1)
#wfAnalysis_radar
#colors to use in the generated graphs in row-order
colors <- c('red','green','blue','purple','black','orange','yellow' )
for ( i in 2:7){
# Positional retrieval of the Title. We built the table so we know the 1st one is the original
Original <- wfAnalysis_radar %>%slice(1) %>% select(WorkTitle) %>% pull()
Translation <- wfAnalysis_radar %>%slice(i) %>% select(WorkTitle) %>% pull()
p <- ggradar(
wfAnalysis_radar %>% slice(1,i),
values.radar = c("0", "140", "280"),
grid.min = 0,
grid.mid = 140,
grid.max = 280,
group.line.width = 2,
group.point.size = 1,
group.colours = qq(paste(colors[1], colors[i] ))
) +
labs( title = paste ( Original, "vs", Translation ))+
theme(plot.title = element_text(hjust = 1.0, vjust=2.12))
#
print(p)
}
Congruence gives a good visual idea of the differences between translations but doesn’t give us a quantitative value with which to judge the ‘closeness’ to the original that we are after. We proceed as follows:
Where t_{w_{x}} are the translation’s concordance score, l_{w_{x}} is the Latin Edition concordance score.
#will need
#Text sizes
#wfAnalysis_radar
# get distance (input an WF analysis original as row 1)
d_word = function( WF, translation) {
org = as.numeric(WF %>% select(everything(), -c(WorkTitle)) %>% slice(1))
trans = as.numeric(WF %>% select(everything(), -c(WorkTitle)) %>% slice(translation))
dis = 0
#print(trans)
#print(org)
dis = sum ((trans - org)^2)
sqrt(dis)
}
d_text = function( WF, translation) {
#print(Text_sizes )
org = semi_join(Text_sizes,wfAnalysis_radar, by=c("title"="WorkTitle") ) %>% slice(1) %>% select(wordcount) %>% pull()
trans = semi_join(Text_sizes,wfAnalysis_radar, by=c("title"="WorkTitle") ) %>% slice(translation) %>% select(wordcount) %>% pull()
#print (org)
#print(trans)
dis = 0
dis = sum (((trans - org)/1000) ^2)
sqrt(dis)
}
workframe <- tibble(Title= as.character(), x=as.double(), y=as.double())
for (i in 1: nrow(wfAnalysis_radar)) {
y = d_text(wfAnalysis_radar,i)
x = d_word(wfAnalysis_radar,i)
title = wfAnalysis %>% slice(i) %>% select(WorkTitle) %>% pull()
workmaprow <- tibble(Title= title, x=x, y=y)
workframe = workframe %>% add_row(workmaprow)
}
workframe %>% ggplot( aes(x,y,color=Title)) +
geom_point(size=4)+
scale_color_manual(values = c(colors))+
xlab( "21-d Euclidean Distance (concordance observations)")+
ylab( "Euclidean Distance - Total words in Text/1000")+
labs(title = 'Euclidean Distance of Latin Text', subtitle= "Latin vs several translations and an unrelated work")
Note we have a better idea of how similar the translations are to the original, and we can identify “The Jungle” as different from the original and the other translations. Of note is Translation49844. On examination more than half of the content of this edition is commentary and assorted unrelated short literary works.
It appears that a collection of common adjectives can be shown to give a measure of ‘reliability’ of a translation using a notion of distance based solely in a word frequency analysis. By combining a rule based classification system with text analytics we are able to determine what texts are close to the original and can even distinguish a radically different text from the original and translations.
A number of things could enhance this investigation, expanding to multiple editions of the same book in the same language should show us a ‘control’ in natural divergence as more material is added or revised in each edition.
Additional context clues using NLP could enhance or eliminate a need for a concordance. A broader concordance might capture a more solid signature.
Castagnetto, JM. “Stemming with SnowballC vs hunspell.” GitHub, Feb 12, 2020, (https://gist.github.com/jmcastagnetto/3b0776f7558621e5d06a2a0981b20c2e)
Foster, Timothy. “The Latin Dictionary.” latindictionary.wikidot.com,Feb 13, 2018, http://latindictionary.wikidot.com/adjective:ater.
Krüger, Bernd. “Latin Declension Tables” Cactus2000, 2021, generātor: Latin nouns, Cactus2000
Moffit, Mitchell. “Why The Ancient Greeks Couldn’t See Blue.” Youtube, uploaded by AsapSCIENCE. 20 Nov 2020, https://www.youtube.com/watch?v=D1-WuBbVe2E,
Mohler, Tim. "The 7 Basic Functions of Text Analytics & Text Mining. Lexalitics.com, Dec 17, 2020, https://www.lexalytics.com/lexablog/text-analytics-functions-explained.
Murzintcev, Nikita Ph.D. “Karl Zeller’s variant hunspell Dictionary.” Latin-dict.github.io, Nov 11, 2020, https://latin-dict.github.io/docs/hunspell.html#:~:text=Hunspell%20is%20a%20spell%20checking,cur%C3%A2%E2%80%9D%20or%20%E2%80%9Cmal%C3%A8%E2%80%9D
Petterson, Daniel, and Rosengren Amelie. “Latin Dictionaries. 4 searchable Latin Dictionaries” Latinitum. 2020, https://www.latinitium.com/latin-dictionaries?t=lsn6076,lsn6077